From Graph to Result
Based on the batch-level data, we can see that ‘Dave’ – and apparently only Dave – has perfect R2 values on every batch of data he reviewed throughout the month of May. Digging deeper will require merging information from the batch level with information at the sample (and possibly peak) level.
all_samples <- dir_ls(here("data"), glob = "*_s.csv") %>%
map_dfr(read_csv) %>%
clean_names()
daves_data <- all_samples %>%
left_join(select(all_batches, -calibration_slope, -calibration_intercept)) %>%
filter(
batch_collected_timestamp > ymd("2017-04-20"),
batch_collected_timestamp < ymd("2017-06-10"),
sample_type == "standard",
reviewer_name == "Dave"
)
The following plots of daves_data provide compelling evidence for what happened: Dave unselected the middle five calibrators in order to draw a straight line and maximize the R2 term.
daves_data %>%
ggplot(aes(x = batch_collected_timestamp, y = used_for_curve, color = compound_name)) +
geom_point() +
facet_grid(compound_name ~ expected_concentration) +
geom_vline(xintercept = as.numeric(as_datetime(c("2017-05-01", "2017-06-01"))),
linetype = 1,
colour = "black") +
theme(axis.text.x = element_text(angle = 90))

daves_data <- daves_data %>%
mutate(pct_diff = (concentration - expected_concentration) / expected_concentration,
within_20 = abs(pct_diff) <= 0.2)
## mutate: new variable 'pct_diff' (double) with 5,318 unique values and 14% NA
## new variable 'within_20' (logical) with 3 unique values and 14% NA
daves_data %>%
filter(compound_name == "codeine") %>%
ggplot(aes(x = batch_collected_timestamp, y = pct_diff, color = within_20)) +
geom_point() +
facet_wrap(~ expected_concentration) +
ggtitle("Codeine Only") +
geom_vline(xintercept = as.numeric(as_datetime(c("2017-05-01", "2017-06-01"))),
linetype = 1,
colour = "black")
## filter: removed 5,740 rows (83%), 1,148 rows remaining

The second plot shows that calibrators were dropped regardless of whether they would be within 20% of the expected concentration, suggesting that they were dropped for some other reason. The data does not say why ‘Dave’ did this, but there are a couple of good guesses here which revolve around training.
We intentionally included several other issues within the database, which will require aggregation and plotting to discover.
Exercise: Revealing problems based on ion ratios
Ion ratios can be particularly sensitive to instrument conditions, and variability is a significant problem in mass spec based assays which use qualifying ions. With the tools that have been demonstrated in this course, we can look for outlier spikes and stability trends, and separate them out across instruments, or compounds, or sample types. Within the 1 year of data provided, identify any potential issues with the data that might suggest problems with workflows, training, or other issues that could impact quality of the results.
This is a very open ended exercise, so consider the following areas to explore:
- Consider all of the qualitative data elements that could influence the observed ion ratios: visualize data as a function of combinations of these variables
- Time-based trending can make abrupt changes very obvious
- Data from all three file types (batches, samples, and peaks) are important in isolating isssues
- When there are a lot of data point to visualize, consider aggregation and/or visualizations that represent statistical summaries or fitting
# need to bring in all peaks data in addition to batches and samples
all_peaks <- dir_ls(here("data"), glob = "*_p.csv") %>%
map_dfr(read_csv, col_types = cols()) %>%
clean_names()
all_samples %>%
left_join(all_batches) %>%
ggplot(aes(x = batch_collected_timestamp, y = ion_ratio, color = compound_name)) +
geom_smooth() +
theme(axis.text.x = element_text(angle = 90)) +
facet_grid(compound_name ~ instrument_name) # doc is grossly out of step,
## Joining, by = c("batch_name", "compound_name")
## left_join: added 9 columns (instrument_name, calibration_slope, calibration_intercept, calibration_r2, batch_passed, …)
## > rows only in x 0
## > rows only in y ( 0)
## > matched rows 2,262,312 (includes duplicates)
## > ===========
## > rows total 2,262,312
## `geom_smooth()` using method = 'gam' and formula 'y ~ s(x, bs = "cs")'

# look in more detail
# calculate weekly average peak area by compound and instrument
# exclude internal standard compounds
mean_by_week <- all_peaks %>%
left_join(all_batches) %>%
filter(!str_detect(compound_name, "-")) %>%
filter(peak_area > 0) %>%
mutate(week = week(batch_collected_timestamp)) %>%
group_by(compound_name, instrument_name, week, chromatogram_name) %>%
summarise(mean = mean(peak_area), sd = sd(peak_area), n = n())
## Joining, by = c("batch_name", "compound_name")
## left_join: added 9 columns (instrument_name, calibration_slope, calibration_intercept, calibration_r2, batch_passed, …)
## > rows only in x 4,489,680
## > rows only in y ( 0)
## > matched rows 4,524,624 (includes duplicates)
## > ===========
## > rows total 9,014,304
## filter: removed 4,489,680 rows (50%), 4,524,624 rows remaining
## filter: removed 2,014,945 rows (45%), 2,509,679 rows remaining
## mutate: new variable 'week' (double) with 53 unique values and 0% NA
## group_by: 4 grouping variables (compound_name, instrument_name, week, chromatogram_name)
## summarise: now 4,452 rows and 7 columns, 3 group variables remaining (compound_name, instrument_name, week)
ggplot(mean_by_week, aes(x = week, y = mean, color = chromatogram_name)) +
geom_line() +
geom_point() +
geom_smooth() +
facet_grid(compound_name ~ instrument_name, scales = "free_y") # quant is constant, qual drops
## `geom_smooth()` using method = 'loess' and formula 'y ~ x'

all_samples %>%
left_join(all_batches) %>%
filter(instrument_name != "doc") %>%
ggplot(aes(x = batch_collected_timestamp, y = ion_ratio, color = compound_name)) +
geom_smooth() +
facet_grid(compound_name ~ instrument_name) +
theme(axis.text.x = element_text(angle = 90)) # grumpy+oxycodone looks least like the others
## Joining, by = c("batch_name", "compound_name")
## left_join: added 9 columns (instrument_name, calibration_slope, calibration_intercept, calibration_r2, batch_passed, …)
## > rows only in x 0
## > rows only in y ( 0)
## > matched rows 2,262,312 (includes duplicates)
## > ===========
## > rows total 2,262,312
## filter: removed 334,152 rows (15%), 1,928,160 rows remaining
## `geom_smooth()` using method = 'gam' and formula 'y ~ s(x, bs = "cs")'

all_samples %>%
left_join(all_batches) %>%
filter(compound_name == "oxycodone" & instrument_name != "doc") %>%
ggplot(aes(x = batch_collected_timestamp, y = ion_ratio, color = instrument_name)) +
ggtitle("oxycodone") +
geom_smooth() # grumpy+oxycodone clearly outlying
## Joining, by = c("batch_name", "compound_name")
## left_join: added 9 columns (instrument_name, calibration_slope, calibration_intercept, calibration_r2, batch_passed, …)
## > rows only in x 0
## > rows only in y ( 0)
## > matched rows 2,262,312 (includes duplicates)
## > ===========
## > rows total 2,262,312
## filter: removed 1,940,952 rows (86%), 321,360 rows remaining
## `geom_smooth()` using method = 'gam' and formula 'y ~ s(x, bs = "cs")'

all_samples %>%
left_join(all_batches) %>%
filter(compound_name == "oxycodone" & instrument_name != "doc" & ion_ratio > 0) %>%
ggplot(aes(x = batch_collected_timestamp, y = ion_ratio, color = instrument_name)) +
ggtitle("oxycodone") +
#geom_point() +
geom_smooth() # ion_ratio!=0 makes it even more clear
## Joining, by = c("batch_name", "compound_name")
## left_join: added 9 columns (instrument_name, calibration_slope, calibration_intercept, calibration_r2, batch_passed, …)
## > rows only in x 0
## > rows only in y ( 0)
## > matched rows 2,262,312 (includes duplicates)
## > ===========
## > rows total 2,262,312
## filter: removed 2,085,040 rows (92%), 177,272 rows remaining
## `geom_smooth()` using method = 'gam' and formula 'y ~ s(x, bs = "cs")'

mean_by_week <- all_peaks %>%
left_join(all_batches) %>%
filter(compound_name == "oxycodone" & instrument_name == "grumpy" & peak_area > 0) %>%
mutate(week = week(batch_collected_timestamp)) %>%
group_by(week, chromatogram_name) %>%
summarise(mean = mean(peak_area), sd = sd(peak_area), n = n())
## Joining, by = c("batch_name", "compound_name")
## left_join: added 9 columns (instrument_name, calibration_slope, calibration_intercept, calibration_r2, batch_passed, …)
## > rows only in x 4,489,680
## > rows only in y ( 0)
## > matched rows 4,524,624 (includes duplicates)
## > ===========
## > rows total 9,014,304
## filter: removed 8,954,306 rows (99%), 59,998 rows remaining
## mutate: new variable 'week' (double) with 53 unique values and 0% NA
## group_by: 2 grouping variables (week, chromatogram_name)
## summarise: now 106 rows and 5 columns, one group variable remaining (week)
ggplot(mean_by_week, aes(x = week, y = mean, color = chromatogram_name)) +
geom_line() +
geom_point() +
geom_smooth() +
ggtitle("oxycodone + grumpy") +
facet_grid(chromatogram_name ~ ., scales = "free_y") # quant is constant, qual drops
## `geom_smooth()` using method = 'loess' and formula 'y ~ x'

mean_by_week %>%
mutate(sd = NULL, n = NULL) %>%
spread(chromatogram_name, mean) %>%
mutate(ion_ratio = quant / qual) %>%
ggplot(aes(x = week, y = ion_ratio)) +
geom_line() +
geom_smooth() +
ggtitle("ion_ratio by week for oxycodone on grumpy") # basically recreate step 4
## mutate (grouped): no changes
## spread: reorganized (chromatogram_name, mean) into (qual, quant) [was 106x3, now 53x3]
## mutate (grouped): new variable 'ion_ratio' (double) with 53 unique values and 0% NA
## `geom_smooth()` using method = 'loess' and formula 'y ~ x'

```r
mean_by_week_reviewer <- all_peaks %>%
left_join(all_batches) %>%
filter(!str_detect(compound_name, "-")) %>%
filter(peak_area > 0) %>%
mutate(week = week(batch_collected_timestamp)) %>%
group_by(compound_name, reviewer_name, week, chromatogram_name) %>%
summarise(mean = mean(peak_area), sd = sd(peak_area), n = n())
## Joining, by = c("batch_name", "compound_name")
## left_join: added 9 columns (instrument_name, calibration_slope, calibration_intercept, calibration_r2, batch_passed, …)
## > rows only in x 4,489,680
## > rows only in y ( 0)
## > matched rows 4,524,624 (includes duplicates)
## > ===========
## > rows total 9,014,304
## filter: removed 4,489,680 rows (50%), 4,524,624 rows remaining
## filter: removed 2,014,945 rows (45%), 2,509,679 rows remaining
## mutate: new variable 'week' (double) with 53 unique values and 0% NA
## group_by: 4 grouping variables (compound_name, reviewer_name, week, chromatogram_name)
## summarise: now 3,924 rows and 7 columns, 3 group variables remaining (compound_name, reviewer_name, week)
ggplot(mean_by_week_reviewer, aes(x = week, y = mean, color = chromatogram_name)) +
geom_line() +
geom_point() +
geom_smooth() +
facet_grid(compound_name ~ reviewer_name, scales = "free_y")
## `geom_smooth()` using method = 'loess' and formula 'y ~ x'

# quants run by Walter show a dip
mean_by_week_Walter <- all_peaks %>%
left_join(all_batches) %>%
filter(reviewer_name == "Walter") %>%
filter(!str_detect(compound_name, "-")) %>%
filter(peak_area > 0) %>%
mutate(week = week(batch_collected_timestamp)) %>%
group_by(compound_name, instrument_name, week, chromatogram_name) %>%
summarise(mean = mean(peak_area), sd = sd(peak_area), n = n())
## Joining, by = c("batch_name", "compound_name")
## left_join: added 9 columns (instrument_name, calibration_slope, calibration_intercept, calibration_r2, batch_passed, …)
## > rows only in x 4,489,680
## > rows only in y ( 0)
## > matched rows 4,524,624 (includes duplicates)
## > ===========
## > rows total 9,014,304
## filter: removed 8,878,896 rows (98%), 135,408 rows remaining
## filter: no rows removed
## filter: removed 75,251 rows (56%), 60,157 rows remaining
## mutate: new variable 'week' (double) with 10 unique values and 0% NA
## group_by: 4 grouping variables (compound_name, instrument_name, week, chromatogram_name)
## summarise: now 792 rows and 7 columns, 3 group variables remaining (compound_name, instrument_name, week)
ggplot(mean_by_week_Walter, aes(x = week, y = mean, color = chromatogram_name)) +
geom_line() +
geom_point() +
geom_smooth() +
facet_grid(compound_name ~ instrument_name, scales = "free_y")
## `geom_smooth()` using method = 'loess' and formula 'y ~ x'

# downward trend in quant peak areas over time across instruments
End of Exercise